-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Implement literal np.timedelta64
coding
#10101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@spencerkclark Thanks! I'll try to look into this tomorrow. The major interaction issues are with |
Thanks @spencerkclark for taking a look at this! I left a couple of suggestions. Incidentally, I think everyone will be happier the sooner we can relax the nanosecond precision restriction in Xarray :). |
@spencerkclark I do not immediately see any issues with other coders with the current implementation. Taking @shoyer's suggestion into account, we would need to make sure that xarray always gives preference to |
…rk/xarray into timedelta64-encoding
for more information, see https://pre-commit.ci
Looking at the implementation, maybe we don't need the separate new coder class and could keep this all in one objects? I think the behavior could probably fit into a single coder:
|
Thanks @shoyer—indeed it's probably simpler to have this all live in The remaining awkward bit relates to interaction with masking and scaling. Since we are overloading the dtype encoding, we do not have a way of retaining the numerical dtype of the data that was written to disk during a round trip. I pushed a failing test in 503db4a to provide an example. Maybe I am just not thinking about this the proper way, however. |
For now I ended up punting on this in favor of forbidding providing any other encoding parameters when encoding timedeltas this new way. The previous encoding path supports this, and that functionality is still maintained in this PR. Another approach would be to use a different name for the attribute we assign the timedelta dtype to, say So possibly this is closer now—I added some more tests and laid the groundwork for eventually turning off automatic decoding of variables with time-like units attributes (we would set |
Just out of curiosity: How will a Also, will it ignore the units of |
Thanks for taking an interest in this PR @jorenham—under the hood
Though maybe for CF convention purposes we should record The units will not be ignored, since the goal of this PR is to faithfully roundtrip |
xarray/coding/times.py
Outdated
dtype = np.dtype(dtype) | ||
resolution, _ = np.datetime_data(dtype) | ||
if resolution not in typing.get_args(PDDatetimeUnitOptions): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree to raise here. It would be good to have a solution for the user, how to decode in those cases. Could we reduce this to a warning and use the nearest fitting decoding ("s" or "ns")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kmuehlbauer, I kind of went back and forth on this. While it might technically be possible, I sort of lean toward being strict until someone produces a compelling complaint, since enabling it would make round tripping more complicated. I think the only way this situation could come up would be if someone constructed a dataset with a variable with a timedelta64[non-pandas-resolution]
dtype attribute outside of xarray, which as you suggest seems unlikely since this is somewhat of an xarray-specific encoding approach.
To enable round tripping in those situations, we would need to decide whether to use the dtype
provided in the encoding
dictionary when saving a variable to disk, or whether to use the dtype
of the array. If we used the dtype
from the encoding dictionary (if it existed) without extra guardrails, it would allow users to specify dtype
encodings that might be inconsistent with their data, e.g. timedelta64[s]
for an array of dtype timedelta64[us]
. If they wanted to do that, I would prefer that they do so by casting the data itself, e.g. astype("timedelta64[s]"
, prior to encoding.
Currently this PR takes the simple approach of always using the dtype
of the array and inferring the units
from that. We might consider warning if the user for whatever reason specified dtype
and units
encoding attributes that conflicted with those—right now we just silently ignore them—but in general, my hope is that users would not ever feel the need to try to modify encoding parameters manually with this approach and rather use astype
instead.
Does that make sense? Maybe I'm just being overly reluctant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A compromise could be that we decode and warn that not only the resolution of the array will be different, but also that round tripping will not produce identical results. I pushed that in 0e67a04. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great compromise! Thanks @spencerkclark, 💯!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a very neat implementation of literal timedelta coding! I left one comment, where I expect possible user issues. But those cases should be rare, I guess.
This reverts commit 0929ec4.
This PR implements @shoyer's suggested approach for "literal" coding of
np.timedelta64
values. Accordingly, it provides a pathway for roundtrippingnp.timedelta64
data without aFutureWarning
, and preserves the encoding of variables that were encoded on disk with the previous approach.I still want to reflect a little more on whether we want any more tests, but this seems functional at the moment—i.e. this example runs without a warning1:
@kmuehlbauer let me know if you have any initial thoughts, particularly with respect to possible interaction with other coders.
whats-new.rst
Footnotes
Note I needed to move away from nanosecond resolution here, since creating bytes with
to_netcdf
will attempt to castint64
data toint32
which leads to overflow. ↩